Large vocabulary ASR for spontaneous czech in the MALACH project
نویسندگان
چکیده
This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate.
منابع مشابه
Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments
In this paper we describe the initial stages of the ASR component of the MALACH (Multilingual Access to Large Spoken Archives) project. This project will attempt to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) by advancing the state of the art in automated speech recognition. In order to train the ASR s...
متن کاملFast Phonetic/Lexical Searching in the Archives of the Czech Holocaust Testimonies: Advancing Towards the MALACH Project Visions
In this paper we describe the system for a fast phonetic/lexical searching in the large archives of the Czech holocaust testimonies. The developed system is the first step to a fulfillment of the MALACH project visions [1,2], at least as for an easier and faster access to the Czech part of the archives. More than one thousand hours of spontaneous, accented and highly emotional speech of Czech h...
متن کاملIssues in Annotation of the Czech Spontaneous Speech Corpus in the MALACH project
The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally
متن کاملTowards automatic transcription of large spoken archives - English ASR for the MALACH project
Digital archives have emerged as the pre-eminent method for capturing the human experience. Before such archives can be used efficiently, their contents must be described. The NSF-funded MALACH project aims to provide improved access to large spoken archives by advancing the state-of-the-art in automated speech recognition (ASR), Information Retrieval (IR) and related technologies [1, 2] for mu...
متن کامل